Goto

Collaborating Authors

 elt pipeline


ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines

arXiv.org Artificial Intelligence

Practitioners are increasingly turning to Extract-Load-Transform (ELT) pipelines with the widespread adoption of cloud data warehouses. However, designing these pipelines often involves significant manual work to ensure correctness. Recent advances in AI-based methods, which have shown strong capabilities in data tasks, such as text-to-SQL, present an opportunity to alleviate manual efforts in developing ELT pipelines. Unfortunately, current benchmarks in data engineering only evaluate isolated tasks, such as using data tools and writing data transformation queries, leaving a significant gap in evaluating AI agents for generating end-to-end ELT pipelines. To fill this gap, we introduce ELT-Bench, an end-to-end benchmark designed to assess the capabilities of AI agents to build ELT pipelines. ELT-Bench consists of 100 pipelines, including 835 source tables and 203 data models across various domains. By simulating realistic scenarios involving the integration of diverse data sources and the use of popular data tools, ELT-Bench evaluates AI agents' abilities in handling complex data engineering workflows. AI agents must interact with databases and data tools, write code and SQL queries, and orchestrate every pipeline stage. We evaluate two representative code agent frameworks, Spider-Agent and SWE-Agent, using six popular Large Language Models (LLMs) on ELT-Bench. The highest-performing agent, Spider-Agent Claude-3.7-Sonnet with extended thinking, correctly generates only 3.9% of data models, with an average cost of $4.30 and 89.3 steps per pipeline. Our experimental results demonstrate the challenges of ELT-Bench and highlight the need for a more advanced AI agent to reduce manual effort in ELT workflows. Our code and data are available at https://github.com/uiuc-kang-lab/ELT-Bench.


From Data Extraction to Transformation: Creating an ELT Pipeline with Python

#artificialintelligence

Extracting and transforming data is a crucial task in the field of data analytics and data science. The process of extracting data from various sources, transforming it to fit specific business requirements, and loading it into a data warehouse or data lake is commonly known as ETL (Extract, Transform, Load). However, in recent years, a new approach called ELT (Extract, Load, Transform) has emerged, which emphasizes loading data into a target data store before transforming it. In this tutorial, we will walk you through the process of creating an ELT pipeline using Python. The first step is to set up the development environment and install the required dependencies.


How I Redesigned over 100 ETL into ELT Data Pipelines - KDnuggets

#artificialintelligence

Everyone: What do Data Engineers do? Everyone: You mean like a plumber? Data Scientists build models and Data Analysts communicate data to stakeholders. So, what do we need Data Engineers for? Little do they know, without Data Engineers, models won't even exist.


ETL Pipelines with Airflow: the Good, the Bad and the Ugly

#artificialintelligence

Airflow is a popular open-source workflow management platform. Many data teams also use Airflow for their ETL pipelines. For example, I've previously used Airflow transfer operators to replicate data between databases, data lakes and data warehouses. I've also used Airflow transformation operators to preprocess data for machine learning algorithms. But is using Airflow for your ETL pipelines a good practice today?